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ABSTRACT 

The client computing platform is moving towards a heterogeneous 
architecture consisting of a combmation of cores focused on scalar 
performance, and a set of throughput-oriented cores. The throughput 
oriented cores (e.g. a GPU) may be connected over both coherent 
and non-coherent interconnects, and have different ISAs. This paper 
describes a programming model for such heterogeneous platforms. 
We discuss the language constructs, runtime implementation, and 
the memory model for such a programming enviroimient. We 
implemented this programming environment in a x86 heterogeneous 
platform simulator. We ported a number of workloads to our 
programming environment, and present the performance of our 
programming environment on these workloads. 

Categories and Subject Descriptors D . 3 . 3 [Programming 
Languages]: Language Constructs and Features - concurrent 
programming structures, patterns, data types and structures. 

General Terms Performance, Design, Languages. 

Keywords heterogeneous platforms, programming model 

1. Introduction 

Client computing platforms are moving towards a heterogeneous 
architecture with some processing cores focused on scalar 
performance, while other cores are focused on throughput 
performance. For example, desktop and notebook platforms may 
ship with one or more CPUs (central processing unit), primarily 
focused on scalar performance, along with a GPU (graphics 
processing unit) that can be used for accelerating highly data parallel 
kernels. These heterogeneous platforms can be used to provide 
significant performance boost on highly parallel non-graphics 
workloads in image processing, medical imaging, data mining[6] 
and other domains[10]. Several vendors have also come out with 
programming environments for such platforms such as CUD A [11], 
CTM [2] and OpenCL [12]. 

Heterogeneous platforms have a number of unique architectural 

• The throughput oriented cores (e.g. a GPU) may be 
connected in both integrated and discrete forms. For 
example, current Intel graphics processors are integrated 
with the chipset. On the other hand, other current GPUs are 
attached in a discrete manner over PCl-Express. A system 
may also have a hybrid configuration where a low-power 
lower-performance GPU is integrated with the chipset, and 
a higher-performance GPU is attached as a discrete device. 
Finally, a platform may also have multiple discrete GPU 
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cards. The platform configuration determines many 
parameters such as bandwidth and latency between the 
different kinds of processors, the cache coherency support, 

• The scalar and throughput oriented cores may have 
different operating systems. For example, Intel's upcoming 
Larrabee [16] processor can have its own kernel. This 
means that the virtual memory translation schemes (virtual 
to physical address translation) can be different between the 
different kinds of cores. The same virtual address can be 
simultaneously mapped to two different physical addresses 
- one residing in CPU memory and the other residing in 
Larrabee memory. This also means that the system 
environment (loaders, linkers, etc.) can be different between 
the two. For example, the loader may load the application at 
different base addresses on the different cores. 

• The scalar and throughput oriented cores may have 
different ISAs and hence tiie same code can not be run on 
both kinds of cores. 

A programming model for heterogeneous platforms must address 
all of the above architectural constraints. Unfortunately, existing 
models such as CUDA and CTM address only the ISA 
heterogeneity by providing language annotations to mark out code 
that must run on GPUs, but do not take other constraints into 
account. For example, CUDA does not address the memory 
management issues between CPU and GPU. It assumes that the 
CPU and GPU are separate address spaces with the programmer 
using separate memory allocation functions for the CPU and the 
GPU. Further, the programmer must explicitly serialize data 
structures, decide on the sharing protocol, and transfer the data back 
and forth. 

In this paper, we propose a new programming model for 
heterogeneous x86 platforms that addresses all the above issues. 
First, we propose a uniform programming model for different 
platform configurations. Second, we propose using a shared memory 
model among all the cores in the platform (e.g. between the CPU 
and Larrabee cores). Instead of sharing the entire virtual address 
space, we propose that only a part of the virtual address space be 
shared to enable an efficient implementation. Finally, like 
conventional programming models we use language annotations to 
demarcate code that must run on the different cores, but improve 
upon conventional models by extending our language support to 
include features such as function pointers. 

We break from existing CPU-GPU programming models and 
propo.se a shared memory model since it opens up a completely new 
programming paradigm that improves overall platform performance. 
A shared memory model allows pointers and data structures to be 
seamlessly shared between the different cores (e.g. CPU and 
Larrabee) without requiring any marshalling. For example, consider 
a game engine that includes physics, Al, and rendering. A shared 
memory model allows the physics, Al and game logic to be run on 
the scalar cores (e.g. CPU), while the rendering runs on the 
throughput cores (e.g. Larrabee), with both the scalar and 
throughput cores sharing the scene graph. Such an execution model 
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is not possible in current programming environments since the scene 
graph would have to be seriahzed back and forth. 

We implemented our full programming environment including 
the language and runtime support and ported a number of highly 
parallel non-graphics workloads to this environment. In addition, we 
have ported a full gaming application to our system. Our 
implementation works with different operating system kernels 
running on the scalar and throughput oriented cores. 

Like other environments such as CUDA and OpenCL our 
programming enviroimient is also based on C. This allows us to 
target the vast majority of existing and emerging throughput 
oriented applications. But this also implies that we do not support 
the higher level programming abstractions found in languages such 
as XIO [14]. Nevertheless our environment significantly improves 
the programmability of heterogeneous platforms by eliminating 
explicit memory management and serialization. For example, using 
existing models, a commercial gaming application (that was ported 
to our system) required about 1.5 weeks of coding to handle data 
management of each new game feature, with the game itself 
including about a dozen features. The serialization arises since the 
rendering is performed on the Larrabee, while the physics and game 
logic are executed on the CPU. All of this coding (and the 
associated debugging, etc.) goes away in our system since the scene 
graph and associated structures are placed in shared memory and 
used concurrently by all the cores in the platform. 

We ported our programming environment to a heterogeneous 
x86 platform simulator that simulates a set of Larrabee-like 
throughput oriented cores attached as a discrete PCI-Express device 
to the CPU. We used such a platform for 2 reasons. First, we believe 
Larrabee is more representative of how GPUs are going to evolve 
into throughput oriented cores. Second, it poses greater challenges 
due to the heterogeneity in the system software stack as opposed to 
simply ISA heterogeneity. Section 6 presents performance results on 
a variety of workloads. 

To summarize, this paper discusses the design and 
implementation of a new programming model for heterogeneous 
x86 (CPU-Larrabee) platforms. We make the following 
contributions: 

• Provide a shared memory semantics between the CPU and 
Larrabee allowing pointers and data structures to be shared 
seamlessly. This extends previous work in the areas of 
DSM and PGAS languages by providing shared memory 
semantics in a platform with heterogeneous ISA, operating 
system kernels, etc. We also improve application 
performance by allowing user-level communication 
between the CPU and Larrabee. 

• Provide a uniform programming model for different 
platform configurations. 

• Implement the model in a heterogeneous x86 platform and 
show the performance of a number of throughput oriented 
workloads in our programming environment. 

In the rest of the paper, we first provide a brief overview of the 
Larrabee architecture, then discuss the proposed memory model, and 
describe the language constructs for programming this platform. We 
then describe our prototype implementation, and finally present 
performance numbers. 

1.1 Larrabee Architecture 

Larrabee [16] (also referred to as LRB in this paper) is a many-core 
x86 visual computing architecture that is based on in-order cores 
that run an extended version of the x86 instruction set, including 
wide vector processing instructions and some specialized scalar 
instructions. Each of the cores contains a 32 KB instruction cache 
and a 32 KB LI data cache, and accesses its own subset of a 



coherent L2 cache to provide high bandwidth L2 cache access. The 
L2 cache subset is 256 KB and the subsets are connected by a high 
bandwidth on-die ring interconnect. Data written by a CPU core is 
stored in its own L2 cache subset and is flushed from other subsets, 
if necessary. Each ring data path is 512 bits wide per direction. The 
fixed function units and memory controller are spread across the 
ring to reduce congestion. 

Each core has 4 hyper-threads with separate register sets per thread. 
Instruction issue alternates between the threads and covers cases 
where the compiler is unable to schedule code without stalls. The 
core uses a dual issue decoder and the pairing rules for the primary 
and secondary instruction pipes are deterministic. All instructions 
can issue on the primary pipe, while the secondary pipe supports a 
large subset of the scalar x86 instruction set, including loads, stores, 
simple ALU operations, vector stores, etc. The core supports 64 bit 
extensions and the full Pentium processor x86 instruction set. 
Larrabee has a 16 wide vector processing unit which executes 
integer, single precision float, and double precision float 
instiiictions. The vector unit supports gather-scatter, masked 
instructions, and supports instructions with up to 3 source operands. 

2. Memory Model 

The memory model for our system provides a window of shared 
addresses between the CPU and LRB, such as in PGAS [17] 
languages, but enhances it with additional ownership annotations. 
Any data structure that is shared between the CPU and LRB must be 
allocated by the programmer in this space. The system provides a 
special malloc function that allocates data in this space. Static 
variables can be annotated with a type qualifier to have them 
allocated in the shared window. However, unlike PGAS languages 
there is no notion of affinity in the shared window. This is because 
data in the shared space must migrate between the CPU and LRB 
caches as it gets used by each processor. Also the representation of 
pointers does not change between the shared and private spaces. 

The remaining virtual address space is private to the CPU and 
LRB. By default data gets allocated in this space, and is not visible 
to the other side. We choose this partitioned address space approach 
since it cuts down on the amount of memory that needs to be kept 
coherent and enables a more efficient implementation for discrete 
devices. 

The proposed memory model can be extended in a straightforward 
way to multi-LRB configurations. The window of shared virtual 
addresses extends across all the devices. Any data structures 
allocated in this shared address window is visible to all agents, and 
pointers in this space can be freely exchanged between the devices. 
In addition, every agent has its own private memory. 



Shared space 


LRB space 


CPU space 



Figure 1: CPU-LRB memory model 
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We propose using release consistency in the shared address 
space due to several reasons. First, the system only needs to 
remember all the writes between successive release points, not the 
sequence of individual writes. This makes it easier to do bulk 
transfers at release points (e.g. several pages at a time). This is 
especially important in the discrete configuration since it is more 
efficient to transfer bulk data (e.g. a 4KB page) over PCI-Express. 
Second, ii allows memory updates to be kept completely local till a 
release point, which is again important in a discrete configuration. In 
general, the release consistency model is a good match for the 
programming patterns in CPU-GPU platforms since there are 
natural release and acquire points in such programs. For example a 
call from the CPU into the GPU is one such point. Making any of 
the CPU updates visible to the GPU before the call does not serve 
any purpose, and neither does it serve any purpose to enforce any 
order on how the CPU updates become visible, as long as all of 
them are visible before tiie GPU starts executing. Finally, the 
proposed C/C++ memory model [5] can be mapped easily to our 
shared memory space. In general, race-fi'ee programs do not get 
affected by the weaker consistency model of our shared memory 
space, and we did not want to constrain the implementation for the 
sake of providing some guarantees on racy programs. 

We augment our shared memory model with ownership rights to 
enable further coherence optimizations. Within the shared virtual 
address window, the CPU or LRB can specify at a particular point in 
time that it owns a specific chunk of addresses. If an address range 
in the shared window is owned by the CPU, then the CPU knows 
that LRB can not access those addresses and hence does not need to 
maintain coherence of those addresses with LRB - for example, it 
can avoid sending any snoops or other coherence information to 
LRB. The same is true of LRB owned addresses. If a CPU owned 
address is accessed by LRB, then the address becomes un-owned 
(with symmetrical behavior for LRB owned addresses). We provide 
these ownership rights to leverage common usage models. For 
example, the CPU first accesses some data (e.g. initializing a data 
structure), and then hands it over to LRB (e.g. computing on the 
data structure in a data parallel manner), and then the CPU analyzes 
the results of the computation and so on. The ovraership rights allow 
an appUcation to inform the system of this temporal locaUty and 
optimize the coherence implementation. Note that these ovmership 
rights are optimization hints and it is legal for the system to ignore 
these hints. 

3. Language Constructs 

To deal with the platform heterogeneity we add constructs to C that 
allow the programmer to specify whether a particular data item 
should be shared or private, and to specify whether a particular code 
chunk should be run on the CPU or LRB. 

The first construct is the shared type qualifier, similar to UPC [17], 
which specifies a variable that is shared between the CPU & LRB. 
The qualifier can also be associated with pointer types to imply that 
the target of the pointer is in shared space. For example, one can 



shared int varl; // int is in shared space 

int varl; // int is not in shared space 

shared int*ptrl; //ptrl points to a shared location 

int* ptrl; // ptrl points to private space 

shared int *shared ptrl; //ptrl points to shared and is shared 

It is the programmer's responsibility to tag all data that is shared 
between the CPU & LRB with the shared keyword. The compiler 
allocates global shared variables in the shared memory space, while 
the system provides a special maUoc function to allocate data in the 
shared memory. The actual virtual address range in each space is 



decided by the system and is transparent to the user. Variables with 
automatic storage (e.g. stack allocated variables) are not allowed to 
be marked with the keyword shared. 

We use an attiibute, attribute(LRB), to mark functions that should 

be executed on the LRB. For such functions, the compiler generates 
LRB -specific code. When a non-annotated function calls a LRB 
annotated function, it implies a call from the CPU to LRB. The 
compiler checks that all pointer arguments have shared type and 
invokes a runtime API for the remote call. Function pointer types 
are also annotated with the attribute notation implying that they 
point to functions that are executed on LRB. Non annotated 
function pointer types point to functions that execute on the CPU. 
The compiler checks type equivalence during an assignment - e.g., a 
function pointer with the LRB attribute must always be assigned the 
address of a LRB annotated function. 

Our third construct denotes functions that execute on the CPU but 
can be called from LRB. One could provide a similar functionality 
by modifying the loader and linker, rather than through a language 

construct. These functions are denoted using attribute(wrapper). 

We used this in 2 ways. First, many programs link with precompiled 
libraries that can execute on the CPU. The functions in these 
libraries are marked as wrapper calls so that they execute on the 
CPU if called from LRB code. Second, while porting large 
programs from a CPU-only execution mode to a CPU-LRB mode, it 
is very helpfiil to incrementally port the program. The wrapper 
attribute allows the programmer to stop the porting effort at any 
point in the call tree by calling back into the CPU. When a LRB 
function calls a wrapper function, the compiler invokes a runtime 
API for the remote call from LRB to the CPU. Making LRB to CPU 
calls explicit allows the compiler to check that any pointer 
arguments have the shared type. 

We also provide a construct and the corresponding runtime support 
for making asynchronous calls from the CPU to LRB. This allows 
the CPU to avoid waiting for LRB computation to finish. Instead the 
runtime system returns a handle that the CPU can query for 
completion. Since this does not introduce any new design issues, we 
focus mostiy on synchronous calls in the rest of this paper. 

Data annotation rules 

1. Shared can be used to qualify the type of variables with 
global storage. Shared cannot be used to qualify a variable 
with automatic storage unless it qualifies a pointer's 
referenced type. 

2. A pointer in private space can point to any space. A pointer 
in shared space can only point to shared space but not to 
private space. 

3. Shared cannot be used to qualify a single member of a 
structure or union unless it qualities a pointer's referenced 
type. A structure or union type can have the shared 
qualifier which then requires all fields to have the shared 
qualifier as well. 

The following rules are applied to pointer iiiaiiipulalioiis: 

1. Binary operator (+,-„==, ) is only allowed between two 

pointers pointing to the same space. The system provides 
API functions that perform djmamic checks. When an 
integer expression is added to or subtracted from a 
pointer, the result has the same type as the pointer. 
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2. Assignment/casting from pointer-to-shared to pointer-to- 
private is allowed. If a type is not annotated we assume 
that it denotes a private object. This makes it difficult to 
pass shared objects to legacy functions since then- 
signature requires private objects. The cast allows us to 
avoid copying between private and shared spaces when 
passing shared data to a legacy function. 

3. Assignment/casting from pointer-to-private to pointer-to- 
shared is allowed only through a dynamic_cast. The 
dynamic_cast checks at runtime that the pointer-to-shared 
actually points to shared space. If the check fails, an error 
is thrown and the usei- lias to explicitly copy the data from 
private space to sliaied space. This cast allows legacy 
code to efficiently return values. 

Our casting rules try to strike a compromise between efficiency 
and safety. We found that large apphcations heavily use 
precompiled libraries and the program performed significant 
copying if we did not provide casts. The disadvantage with the 
casting rules is that we can no longer leverage some static properties 
- for example eliding synchronization between CPU and LRB while 
accessing a private object. In fiiture we expect to investigate more 
sophisticated type systems such as polymorphic type systems. 

Our language can allow casting between the two spaces (with 
possibly a dynamic check) since oiu" data representation remains the 
same regardless of whether the data is in shared or private space. 
Even pointers have the same representation regardless of whether 
they are pointing to private or shared space. Given any virtual 
address V in the shared address window, both CPU and LRB have 
their own local physical address corresponding to this virtual 
address. Pointers on CPU and LRB read from this local copy of the 
address, and the local copies get synced up as required by the 
memory model. 

Privatization and globalization: Shared data can be privatized by 
copying from shared space to the private space. Non-pointer 
containing data structures can be privatized simply by copying the 
memory contents. While copying pointer containing data structures, 
pointers into shared data must be converted to pointers into private 
data. Private data can be globalized by copying from the private 
space to the shared space and made visible to other computations. 
Non-pointer containing data structures can be globalized simply by 
copying the memory contents. While copying pointer containing 
data stiiictures, pointers into private data must be converted as 
pointers into shared data (converse of the privatization example). 

For example, consider a linked Ust of nodes in private and shared 
space. The type definition for the private linked list is standard: 

typedef struct { 

int val; // just an int field 
Node* next; 
} Node; 

The type definition for the shared linked hst is shown below. Note 
that the pointer to the next node is defined to reside in shared space. 
The user must explicitly declare both the private and shared versions 
of a type. 

typedef struct { 
shared int val; 
shared Node * shared next; 
} shared Node; 



Now the user can explicitly copy a private linked list to shared space 
by using the following: 

myNode = (shared Node*) sharedMalloc ( . . ) ; 

// head points to the private linked list 
myNode->val = head->val 

myNode->next = (shared Node*) 

SharedMalloc ( . . ) ; 



Code annotation rules 

L A attribute(LRB) fimction is not allowed to call a non- 
annotated fimction. This is to ensure that the compiler knows 
about all the CPU functions that can be called from the LRB. 

2. A attribute(wrapper) function is not allowed to caU into a 

attribute(LRB) function. This is primarily an 

implementation restriction in our system. 

3. Any pointer parameter of a LRB or wrapper annotated 
function must point to shared space. 

The calling rules for functions also apply to function pointers - for 
example a attribute(LRB) function pointer called from a non- 
annotated function results in a CPU-LRB call. Similarly, un- 
annotated fiinction pointers can not be called from LRB functions. 
The runtime API used by the compiler is shown below: 

// Allocate and free memory in the private 
address space. Maps to regular malloc 

void* privateMalloc (int) ; 

void privateFree (void*) ; 

//Allocation & free from the shared space. 

shared void* sharedMalloc ( size_t size); 

void sharedFree (shared void *ptr) ; 

// Memory consistency for shared memory 

void sharedAcquire () ; 

void sharedRelease () ; 

The runtime also provides APIs for mutexes and barriers to aUow 
the application to perform expUcit synchronization. These constructs 
are always allocated in the shared area. 

The language provides natural acquire and release points. For 
example, a call Irom the CPU to LRB is a release point on the CPU 
followed by an acquire point on the LRB. Similarly, a return from 
the LRB is a release point on the LRB and an acquire point on the 
CPU. Taking ownership of a mutex and releasing a mutex are 
acquire and release points respectively for the processor doing the 
mutex operation, while hitting a barrier and getting past a barrier are 
release and acquire points as well. 

3.1 Ownership constructs 

We also provide an API that allows the user to allocate chunks of 
memory (hereafter called arenas) inside the shared virtual region. 
The programmer can dynamically set the ownership attributes of an 
arena. Acquiring ownership of an arena also acts as an acquire 
operation on the arena, and releasing ownership of an arena acts as 
the release operation on the arena. The programmer can allocate 
space within an arena by passing the arena as an argument to a 
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special malloc function. The runtime grows the arena as neede 
runtime API is shown below: 

// Owned by the caller or shared 

Arena *allocateArena (OwnershipType type) ; 

//Allocate and free within arena 



shared ^ 



inaMalloc ( Arena 



void arenaFree ( Arena *, shared void *); 

// Ownership for arena. If null changes 
ownership of entire shared area 

OwnershipType acquireOwnership (Arena*) ; 

OwnershipType releaseOwnership (Arena*) ; 

//Consistency for arena 

void arenaAcquire (Arena *) ; 

void arenaRelease (Arena *) ; 

The programmer can optimize the coherence implementation by 
using the ownership API. For example, in a gaming appUcation, 
while the CPU is generating a frame, LRB may be rendering the 
previous frame. To leverage this pattern the programmer can 
allocate two arenas with the CPU acquiring ownership of one arena 
and generating the frame into that arena, while the LRB acquires 
ownership of the other arena and renders the frame in that arena. 
This prevents coherence messages being exchanged between the 
CPU and LRB while the frames are being processed. When the CPU 
and LRB are finished with their current frames, they exchange 
ownership of their arenas so that they continue to work without 
incurring coherence overhead. 

4. Code example 

This section illustrates the proposed programming model through a 
code example that shows a simple, vector addition 
(addTwo Vectors ) being accelerated on LRB. 



; addTwoVecto: 



. 64) { 
-] + b[i] 



= 1 to 54) { 
■- a[i] + b[i]; 



It someApp ( . . ) 
// allocate in shared region 
shared int* a = sharedMalloc ( . . ) 
shared int* b = sharedMalloc (..) 
shared int* c = sharedMalloc ( . . ) 
for (i = 1 to 64) // initialize 



a[i 



b[i 



c[i 



// compiler converts into remote call 
addTwoVectors (a, b, c) ; 



In the above implementation, arrays a, b, c are allocated in shared 
space by calUng the special malloc function. The remote call 
(addTwoVectors) acts as the release/acquire point and causes the 
memory region to be synced up between CPU & LRB. Finally, the 
programmer can add in the appropriate ownership calls to make the 
implementation more efficient. In this particular example we hand 
over ownership of the entire shared address space. If the CPU makes 
an asynchronous call, then the appUcation would hand over 
ownership of specific arenas so that the CPU can perform some 
computation concurrenfly with LRB. 



acquireOwnership (NULL) ; 
releaseOwnership (NULL) ; 
.nt someApp ( . . ) 



int someApp ( ...) 



allo( 



(. . 



int *b = malloc ( . . ; 
int *c = malloc ( . . ; 

for (i = 1 to 64) , 
{a[i] = ; b[i] 

addTwoVectors (a, b, 



c[i] - 



IT programming model, this would be w 



releaseOwnership (NULL) ; 
addTwoVectors (a, b, c) ; 
acquireOwnership (NULL) ; 



} 

5. Implementation 

The compiler generates 2 binaries - one for execution on LRB and 
another for CPU execution. We generate two different executables 
since the two operating systems can have different executable 
formats. The LRB binary contains the code that will execute on 
LRB (annotated with the LRB attribute), while the CPU binary 
contains the CPU functions which include all un-annotated and 
wrapper annotated functions. Our runtime library has a CPU and 
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LRB component which are hnked with the CPU and LRB 
application binaries to create the CPU and LRB executables. When 
the CPU binary starts executing, it calls a runtime function that 
loads the LRB executable. Both the CPU and LRB binaries create a 
daemon thread that is used for CPU-LRB communication. 

5.1 Implementing CPU-LRB shared memory 

Our implementation reuses some ideas from software distributed 
shared memory [9] [4] schemes, but there are some significant 
differences as well. Unlike DSMs. our implementation is 
complicated by the fact that the CPU and LRB can have different 
page tables and different virtual to physical memory translations. 
Thus, when we want to S3tic up the contents of virtual address V 
between the CPU and LRB (e.g. at a release point), we may need to 
sync up the contents of different physical addresses, (say) PI on 
CPU and P2 on LRB. Unfortunately, the CPU does not have access 
to LRB's page tables (hence does not know P2) and LRB can not 
access the CPU's page tables and does not know Pi. 
We solve this problem by leveraging the PCI aperture in a novel 
way. During initialization we map a portion of the PCI aperture 
space into the user space of the apphcation and instantiate it with a 
task queue, a message queue, and copy buffers. When we need to 
copy pages, say from the CPU to LRB, the runtime copies the pages 
into the PCI aperture copy buffers and tags the buffers with the 
virtual address and the process identifier. On the LRB side, the 
daemon thread copies the contents of the buffers into its address 
space by using the virtual address tag. Thus we perform the copy in 
a 2 step process - the CPU copies from its address space into a 
common buffer (PCI aperture) that both CPU and LRB can access, 
while the LRB picks up the pages from the common buffer into its 
address space. LRB -CPU copies are done in a similar way. Note that 
since the aperture is pinned memory, the contents of the aperture are 
not lost if the CPU or LRB process gets context switched out. This 
allows the two processors to execute asynchronously which is 
critical since the two processors can have different operating 
systems and hence the context switches can not be synchronized. 
Finally, note that we map the aperture space into the user space of 
the apphcation thus enabling user level CPU-LRB communication. 
This makes the apphcation stack vastly more efficient than going 
through the OS driver stack. To ensure security, the aperture space is 
partitioned among the CPU processes that want to use LRB. At 
present, a maximum of 8 processes can use the aperture space. 
We exploit one other difference between traditional software DSMs 
and CPU-LRB platforms. Traditional DSMs were designed to scale 
on medium to large clusters. In contrast, CPU-LRB systems are very 
small scale clusters. We do not expect more than a handflil of LRB 
cards and CPU sockets to be used well into the future. Moreover, 
the PCI aperture provides a convenient shared physical memory 
space between the different processors. Thus we are able to 
centralize many data structures and make the implementation more 
efficient. For example, we put a directory in the PCI aperture that 
contains metadata about the pages in the shared address region. The 
metadata says whether the CPU or LRB holds the golden copy of a 
page (home for the page), contains a version number that tracks the 
number of updates to the page, mutexes that are acquired before 
updating the page, and miscellaneous metadata. The directory is 
indexed by the virtual address of a page. Both the CPU and the LRB 
runtime systems maintain a private structure that contains local 
access permissions for the pages, and the local version numbers of 
the pages. 



When LRB performs an acquire operation, the cortesponding pages 
are set to no-access on LRB. At a subsequent read operation the 
page fault handler on the LRB copies the page from the home 
location if the page has been updated and released since the last 
LRB acquire. The directory and private version numbers are used to 
determine this. The page is then set to read-only. At a subsequent 
write operation the page fault handler creates the backup copy of the 
page, marks the page as read-write and increments the local version 
number of the page. At a release point, we perform a diff with the 
backup copy of the page and transmit the changes to the home 
location, while incrementing the directory version number. The 
CPU operations are done in a symmetrical way. Thus, between 
acquire and release points the LRB and CPU operate out of their 
local memory and communicate with each other only at the explicit 
synchronization points. 

At startup the implementation decides the address range that will be 
shared between CPU and LRB, and makes sure that this address 
range always remains mapped. This address range can grow 
dynamically, and does not have to be contiguous, though in a 64 bit 
address space the runtime system can reserve a continuous chunk 
upfront. 

5.2 Implementing shared memory ownership 

Every arena has associated metadata that identifies the pages that 
belong to the arena. Suppose LRB acquires ownership of an arena. 
We make the corresponding pages non-accessible on the CPU. We 
copy any arena pages from the home location that have been 
updated and released since the last time the LRB performed an 
acquire operation. We set the pages to read-only so that subsequent 
LRB writes will trigger a fault, and the system can record which 
LRB pages are being updated. In the directory we note that LRB is 
the home node for the arena pages. On a release operation, we 
simply make the pages accessible again on the other side, and 
update the directory version number of the pages. The CPU 
ownership operations are symmetrical. 

Note the performance advantages of acquiring ownership. At a 
release point we no longer need to perform diff operations, and do 
not need to create a backup copy at a write fault, since we know that 
the other side is not updating the page. Second, since the user 
provides specific arenas to be handed over from one side to the 
other, the implementation can perform a bulk copy of the required 
pages at an acquire point. This leads to a more efficient copy 
operation since the setup cost is incurred only once and gets 
amortized over a larger copy. 

5.3 Implementing remote calls 

A CPU-LRB or LRB-CPU remote call is complicated by the fact 
that the two processors can have different system environments, for 
example different loaders. The two binaries are also loaded 
separately and asynchronously. Suppose that the CPU code makes 
some calls into the LRB. When the CPU binary is loaded, tiie LRB 
binary has still not been loaded and hence the addresses for LRB 
functions are still not known. Therefore, the OS loader can not patch 
up the references to LRB functions in the CPU binary. Similarly, 
when the LRB binary is being loaded, the LRB loader does not 
know the addresses of any CPU functions being called from LRB 
code and hence can not patch those addresses. 
We implement remote calls by using a combination of compiler and 
runtime techniques. Our language rules ensure that any function 
involved in remote calls (LRB or wrapper attribute functions) is 
annotated by the user. When compiling such fiinctions, the compiler 
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adds a call to a runtime API that registers function addresses 
dynamically. The compiler creates an initialization function for each 
file that invokes all the different registration calls. When the binary 
gets loaded, the initiaUzation function in each file gets called. The 
shared address space contains a jump table that is populated 
dynamically by the registration function. The table contains one slot 
for every annotated function. The format of every slot is 
<funcName, funcAddr> where funcName is a literal string of the 
function name and funcAddr is the runtime address of the function. 
The translation scheme works as follows. 

1 . If a LRB (CPU) fiinction is being called within a LRB (CPU) 
function, the generated code will do the call as is. 

2. If a LRB function is being called within a CPU function, the 
compiler generated code will do a remote call to LRB: 

2.1. The compiler generated code will look up the jump table 

with the function name and obtain the function address. 

2.2. The generated code will pack the arguments into an 

argument buffer in shared space. It will then call a 
dispatch routine on the LRB side passing in the 
function address and the argument buffer address. 
There is a similar process for a wrapper function except that it is a 
remote call to CPU if a wrapper function is called in a LRB code. 
For function pointer invocations, the translation scheme works as 
follows. When a function pointer with LRB annotation is assigned, 
the compiler generated code will look up the jump table with the 
function name and assign the function pointer with obtained 
function address. Although the lookup can he optimized out when 
LRB annotated function pointer is assigned within LRB code, we 
forsake the optimization to use a single strategy for all function 
pointer assignments. If a LRB function pointer is being called within 
a LRB function, the compiler generated code will do the call as is. 
If a LRB function pointer is being called within a CPU function, the 
compiler generated code will do a remote call to LRB side. The 
process is similar for a wrapper function pointer except that there is 
a remote call to CPU side if wrapper function pointer is called in a 
LRB function. 

The CPU-LRB signaling happens with task queues in the PCI 
aperture space. The daemon threads on both sides poll their 
respective task queues and when they find an entry in the task 
queue, they spawn a new thread to invoke the corresponding 
function. The API for remote invocations is described below. 



synchronous remote calls 
callRemote (FunctionType, 



// Synchronous and i 

RPCHandler 
RPCArgType) ; 

int resultReady (RPCHandler) ; 

Type getResult (RPCHandler) ; 

Finally, the CPU and LRB co-operate while allocating memory in 
the shared area. Each processor allocates memory from either side of 
the shared address window. When one processor consumes half of 
the space, the two processors repartition the available space. 

6. Experimental Evaluation 

We used a heterogeneous platform simulator for measuring the 
performance of different workloads on our programming 
environment. It simulates a modem out of order CPU and a LRB 
like system. The CPU simulation uses a memory and architecture 
configuration similar to that of the Intel Core2Duo processor. The 
LRB-like system was simulated as a discrete PCI-Express device 
with the interconnect latency and bandwidth similar to PCI-Express 



2.0. The LRB-like system only simulated the Pentium u 
set. It did not simulate the new Larrabee instructions and parts of 
the Larrabee memoiy hierarchy such as the interconnect. For 
brevity, in the rest of this section, we use LRB to denote the LRB- 
like system. The simulator ran a production quality SW stack on the 
two processors. The CPU ran Windows Vista, while LRB ran a 
lightweight operating system kernel. 

We used a number of well known parallel non-graphics 
workloads[6] to measure the performance of our system. These 
include the Black Scholes financial workload that does option 
pricing using the Black Scholes method; the EFT (fast fourier 
transform) workload that does a radix-2 EFT algorithm used in 
many domains such as signal processing; the Equake workload that 
is part of SpecOMP and performs an earthquake modeling and is 
representative of HPC applications; Art which is also part of 
SpecOMP and performs image recognition. The reported numbers 
are based on using the standard input sets for each of the 
applications. All these workloads were rewritten using our 
programming constructs and compiled with our tool chain. 




Figure 2: Percent of shared data in memory 

Figure 2 shows the fraction of total memory accesse 
shared data in the above workloads. A vast majority of the at 
were to private data. Note that read-only data accessed by multiple 
threads was privatized manually. This helped in certain benchmarks 
like Black Scholes. It is not surprising that most of the accesses are 
to private data since the computation threads in the workloads 
privatize the data that they operate on to get better memory locality. 
We expect workloads that scale to a large number of cores to 
behave similarly since the programmer must be conscious of data 
locality and avoid false sharing to get good performance. The 
partial virtual address sharing memory model lets us leverage this 
by cutting down on the amount of data that needs to be kept 
coherent. 

We next show the performance of our system on the set of 
workloads. We ran the workloads on a simulated system with 1 
CPU core and varied the number of LRB cores from 6 to 24. The 
workload computation was split between the CPU and LRB cores 
with the compute intensive portions executed on LRB. For 
example, all the option pricing in Black Scholes, and the earthquake 
simulation in Equake is offloaded to LRB. We present the 
performance improvement relative to a single CPU and LRB core. 
Figure 3 compares the performance of our system when the 
appUcation does not use any ownership calls to the case when the 
user optimizes using ownership calls. The bars labeled 
"MineA'ours" represent the performance with ownership calls 
(Mine implies pages were owned by CPU and Yours implies owned 
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by LRB). The bars labeled "Ours" represent the performance 
without any ownership calls. 



Speedup 


FFT Spee( 

■ Mine/Yours 
6 12 


jup 

□ 


Ours 
24 


XX LRBs 


3.5 
3 

a 2-5 
= o 

1 ^■^ 

0.5 


ART Spee 

■ Mine/Yours 

6 12 


iup 

□ 


Ours 
24 


XX LRBs 


Speedup 


BlackScholes £ 

■ Mine/Yous 

mi 

6 12 


>peedup 

□ Ours 
24 


1CPU+ 


Speedup 


EquakeSpe 

■ Mine/Yours 

mi 

6 12 


edup 

□ Ours 

|[ 

24 


1CPU+ 
XX LRBs 



Figure 3: Ownership performance comparison 



As expected, the applications perform better with ownership calls 
than without. To understand the reason for this, we broke down the 
overhead of the system when the application was not using any 
ownership calls. Figure 4 shows the breakdown for Black Scholes. 
We show the breakdown for only one benchmark, but the ratios of the 
different overheads are very similar in all the benchmarks. 

We break up the overhead into 4 categories. One relates to 
handling the page faults since we use a virtual memory based shared 
memory implementation and reads/writes to a page after an acquire 
point trigger a fault. The second relates to the diff operation performed 
at release points to sync up the CPU and LRB copies of a page. The 
third is the amount of time spent in copying data from one side to the 
other. The copy operation is triggered from the page fault handler 
when either processor needs the latest copy of a page. We do not 
include the copy overhead as part of the page fault overhead, but 
present it separately since we beheve different optimizations can be 
applied to optimize it. 

Finally, we show the overhead spent in synchronizing messages. 
Note that in a discrete setting LRB is connected to the CPU over PCI- 
Express. The PCI-Express protocol does not include atomic read- 
modily-write operations. Therefore we have to perform some 
synchronization and hand shaking between the CPU and LRB by 
passing messages. 

When the application uses ownership of arenas, the diff overhead 
is completely eliminated. The page fault handling is reduced since the 
write page fault handler does not have to create a backup copy of the 
page. Moreover, since we copy all the pages at one shot when we 
acquire ownership of an arena, we don't incur read page faults 
anymore. This also significantly reduces the synchronizauon message 
overhead since the CPU and LRB perform the handshaking at only 
ownership acquisition points rather than at many intermediate points 
(e.g. whenever pages are transferred from one side to the other). 
Figure 5 shows the overhead breakdown with ownership calls. 
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Figure 4: Overhead breakdown without ownership 
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Figure 5: Overhead breakdown with ownership 
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Figure 6: Overall performance comparison 

Finally, Figure 6 shows tlie overall performance of our system. 
All the workloads used the ownership APIs. The "ideal" bar 



represents HW supported cache coherence between the CPU and 
LRB cores - in other words this is the best performance that our 
shared memory implementation can provide. For Equake, since the 
amount of data transferred is very small compared to the 
computation involved, we notice that "ideal" and discrete times are 
almost identical. 

In all cases our shared memory implementation has low 
overhead and performs almost as well as the "ideal" case. Black 
Scholes shows the highest comparable overhead since it has the 
lowest compute density - ie, the amount of data transferred per unit 
computation time was the highest. In Black Scholes we transfer 
about 1 3MB of data per second of computation time, while we 
transfer about 0.42MB of data per second of computation time in 
Equake. Hence the memory coherence overhead is negligible in 
Equake. The difference between the "ideal" scenario and our shared 
memory implementation increases with the number of cores mainly 
due to synchronization overhead. In our implementation, 
synchronization penalty increases non-hnearly with the number of 
cores, which would not be the case in HW based coherence. 

7. Related work 

The most closely related work to this paper are the CUDA[11], 
OpenCL [12] and CTM [2] programming environments. Like us, 
OpenCL uses a weakly consistent shared memory model but this is 
restricted to the GPU. 

Our work differs from CUDA, OpenCL and CTM in several 
ways - unlike these environments we define a model for CPU-LRB 
e provide direct user level CPU-LRB 
we consider a bigger set of C language features 
such as function pointers. Implementing a similar memory model is 
challenging on current GPUs. (This is due to the limitations of 
current GPUs and also because the architecture and low-level 
details about these platforms are not documented.) For example, the 
programmer can not use dynamic pointer containing data structures 
and would be restricted to simple serialized data structures such as 
arrays and vectors. The release and acquire points would perform a 
memory copy of the data with the programmer explicitly specifying 
the data to be copied back and forth. 

The Cell processor [8] is another heterogeneous platform. While 
the PPU is akin to a CPU, the SPUs are much simpler than LRB. 
For example, they do not run an operating system kernel. Unlike the 
SPU-PPU pair, the LRB-CPU pair is much more loosely coupled 
since LRB can be paired as a discrete GPU with any CPU running 
any operating system. Unlike our model. Cell programming 
involves explicit DMA between the PPU and SPU. Our memory 
model is similar to that of PGAS languages[I4][I7], and hence our 
language constructs have similarity to UPC [17], however UPC 
does not consider ISA or operating system heterogeneity. Higher 
level PGAS languages such as XIO [14] do not support the 
ownership mechanism which is crucial for a scalable coherence 
implementation in a discrete scenario. Our implementation has 
similarities to software distributed shared memory [9] [4] which also 
leverage virtual memory. Many of these S-DSM systems also use 
release consistency and copy pages lazily on demand. The main 
differences with S-DSM systems is the level of heterogeneity. 
Unlike S-DSM systems we need to consider a computing system 
where the processors have different ISAs and system environments. 
In particular, we need to support different processors with different 
virtual to physical page mappings. Finally, the performance 
tradeoffs are different since S-DSMs were meant to scale on large 
clusters, while CPU-LRB systems should remain small scale 
clusters for some time in the future. The CUBA [7] architecture 
proposes hardware support for faster communication between the 
CPU and GPU. But the programming model assumes that the CPU 
and GPU are separate address spaces. The EXO [18] model 
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provides shared memory between a CPU and accelerators, but it 
requires the page tables to be kept in sync which is irrfeasible in a 
discrete accelerator. 
8. Conclusions 

Heterogeneous computing platforms composed of a general 
purpose scalar oriented CPU and throughput oriented cores (e.g. a 
GPU) are increasingly being used in client computing systems. 
These platforms can be used for accelerating highly parallel 
workloads. There have been several programming model proposals 
for such platforms, but none of them address the CPU-GPU 
memory model. In this paper we propose a new programming 
model for a heterogeneous x86 system with the following key 
features: we propose a shared memory model for all the cores in the 
platform; we propose a uniform programming model for different 
configurations; and we propose user annotations to demarcate code 
for CPU and LRB execution. We implemented the full software 
stack for our programming model including compiler and runtime 
support. We ported a number of parallel workloads to our 
programming model and evaluated the performance on a 
heterogeneous x86 platform simulator. We show that our model can 
be implemented efficiently so that programmers are able to benefit 
from shared memory programming without paying a performance 
penalty. 
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